Introduction
Recently deep neural networks attracted many researchers in computer vision, speech recognition, and natural language processing, etc. Especially the convolutional neural network (CNN) have achieved a great success in image recognition [1]. After the deep CNN by Krizhevsky et al. won the ILSVRC 2012 with higher score than the conventional methods, it becomes very popular as the fundamental technique for image classification. To improve the recognition accuracy further, deeper and complex network architectures have been proposed [2]–[6].
Several methods have been proposed to imporve or speed up the training of the deep neural network. LeCun et al. [9] and Wiesler et al. [10] pointed out that the learning of the neural network converges faster if its inputs are whitened. Whitening of the inputs of each layer make the changes of inputs of each layer uniform and can remove the bad effects of the internal covariate shift. However the computation to perform the whitening of the inputs of each layer is expensive because the whitening needs to calculate the covariance matrix of the full training samples and solve the eigenvalue problem of the covariance matrix.
To simplify the computation of the whitening, Ioff et al. [11] proposed the method called Batch Normalization by using only the dimension-wise normalization for the training samples in the min-batch. By this method, the mean of the inputs of each layer is standarized to zero and the variance to one. Now Batch Normalization is considered as the standard method to improve the learning of the deep neural networks.
Another approach to improve the learning of the deep neural networks is to modify the activation function of the hidden neuron. Historically, the sigmoid function have been often used as the activation function of the neurons in the artificial neural networks. However the standard activation functions such as the sigmoid function or the tangent hyperbolic function are contractive almost everywhere and the gradients at the large values becomes almost zero. This make the updates by the stochastic gradient decent very small. This problem is known as the vanishing gradient problem.
To improve the restricted Boltzmann Machine, Nair et al. [12] introduced the rectified linear units (ReLU). Glorot et al. [13] showed that ReLU activation function in the hidden layers could improve the learning speed of the various deep neural networks. The gradients of the ReLU activation function at the positive values becomes constant and do not vanish anymore. This means that the vanishing gradient problem can be avoided by using ReLU activation function. This is the reason why ReLU activation function can improve the learning speed of the deep neural networks. Now the ReLU is used as the standard activation function for deep neural networks.
Further improvement of the activation function has been proposed in the literature. Recently Clevert et al. [14] proposed to use an activation function with negative outputs named Exponential Linear Units (ELUs). The activation function of ELU has the same shape with the ReLU for the positive inputs. But the outputs of ELU activation function for the negative inputs becomes negative values while ReLU always outputs zero for the negative inputs. These negative outputs can push the mean of the outputs of the activation function toward to zero and can reduce the undesired bias.
From the early work on the cat's visual cortex by Huber and Wiesel [15], it is well known that the neurons in the visual cortex are sensitive to small regions of the visual field (called a receptive field). The small regions are tiled to cover the entire visual field and each neuron acts as local filter over the input visual stimulus. The deep CNN inherits these structure of the visual processing in the brain but usually the parameters (weights) of the receptive fields are trained based on the supervised learning.
Olshausen and Field showed that the importance of sparse coding for self-organization of the receptive fields of simple cells in the primary visual cortex (V1) [16], [17]. They demon-strate that a complete family of localized, oriented, bandpass receptive fields, similar to those found in the primary visual cortex can be obtained by a learning algorithm that attempts to find sparse linear codes for natural scenes. Also they show that the resulting sparse code possesses a higher degree of statistical independence and can provide a more efficient representation for the later stages of the visual processing. These results suggest the importance of the unsupervised learning for self-organization of the receptive fields. The unsupervised learning algorithm proposed by Olshausen and Field can be realized as the auto-encoder with sparseness constraints.
In this paper we propose a method to introduce the sparseness regularization for the convolutional neural network with the ReLU activation functions. Usually the sparseness for the neural networks is introduced to prevent the unnecessary increase of the weights (parameters) in the model. For example, the penalty to minimize the squared norm of the weights is introduced in the weight decay. It is known that such penalty can prevent over-fitting to the training samples and can improve the generalization ability of the trained model. In this paper we introduce the sparseness to the inputs of the ReLU, namely the outputs of the linear filter. This is different from the usual sparseness for the neural networks.
By introducing the sparseness to the input of the ReLU, the unnecessary increase of the outputs of the ReLU can be reduced. This has the similar effect with Batch Normalization. Also the unnecessary negative inputs of the ReLU can be reduced. It is expected that this can improve the generalization ability similar with the weight decay.
This paper is organized as follows. Section II explains the related works such as the CNN, Batch Normalization, the activation function and sparseness. Section III explains the proposed architecture with sparseness regularization. Experiments and results are shown in Section IV. Section V is for the conclusion and the future works.
Related Work
A. Convolutional Neural Network
A deep CNN architecture is usually formed by a stack of distinct layers that transform the input image into the class scores. Four distinct types of layers (convolution layer, pooling layer, fully connected layer, and classification layer) are commonly used. Usually several pairs of the convolution and pooling layers are repeated and then a fully connected layer and a classification layer are followed. Fig. 1 shows an example of the convolution and the pooling layers.
The convolution layer is the core building block of the CNN. Each neuron in the convolution layer has a small receptive field in the input image and computes the output by convolution of the receptive field with a linear filter. If we denote the weights and the bias of the filter for \begin{equation*}
h_{k}=\boldsymbol{w}_{k}^{T}\boldsymbol{x}+b_{k} \tag{1}
\end{equation*}
\begin{equation*}
f(h_{k})=\sigma(h_{k}) \tag{2}
\end{equation*}
In the pooling layer, down-sampling of the feature maps is performed. For example, the feature map is partitioned into a set of non-overlapping rectangle sub-regions and the maximum in each sub-region is taken as its output. This is called max pooling. It is known that the pooling layer is important to make invariant to small shift of the input image. Also the pooling layer reduces the spatial size of the representation to reduce the amount of parameters and computation in the network. Thus it is common to periodically insert a Pooling layer in-between the successive convolution layers in a deep CNN architecture.
In addition to max pooling, several pooling methods have been proposed. Average pooling or L2-norm pooling are examples of such pooling methods. Average pooling have been often used historically but has recently fallen out of favor compared to the max pooling because the max pooling has been shown to work better in practice.
The fully connected layer and classification layer correspond to a multi-layer Perceptron (MLP). After several pairs of convolution and pooling layers, the low-level image features are integrated via fully connected layers. Neurons in a fully connected layer have full connections to all outputs in the previous layer. This is the same as the standard MLP.
The classification layer is used for the final decision and is normally the last layer in the network. It specifies how the network training penalizes the deviation between the predicted and true labels. Usually the soft-max function is used in the classification layer. The objective function of the learning is given by the negative log-likelihood of the outputs of the network as
\begin{equation*}
L=- \sum_{i\in D}\sum_{k=1}^{K}t_{ik}\log(y_{ik}), \tag{3}
\end{equation*}
The standard way to train the network parameters, weights and biases, is the back-propagation learning which is based on the stochastic steepest descent method. The general form of the update rule is given as
\begin{align*}
\boldsymbol{w}\quad & \leftarrow\quad \boldsymbol{w}- \mu\frac{\partial L}{\partial \boldsymbol{w}} \tag{4}\\
\boldsymbol{b}\quad & \leftarrow\quad \boldsymbol{b}- \mu\frac{\partial L}{\partial \boldsymbol{b}} \tag{5}
\end{align*}
B. Batch Normalization
Ioffe et al. [11] proposed Batch Normalization. Batch-normalization enable us to use much higher learning rates and be less careful about initialization. It is also pointed out that Batch Normalization has a function of regularization and make Dropout unnecessary in some case.
The distribution of inputs of each layer changes during the learning process because the outputs of the previous layer are influenced by the changes of the parameters in the previous layer. This is called internal covariate shift. It is known that the learning speed is slowed down by the internal covariate shift because we have to set lower learning rate when such sift becomes large. Also careful parameter initialization is necessary. This makes notoriously hard to train models with saturating nonlinearities.
To improve the learning we have to reduce the internal covariate shift. We can improve the learning speed by keeping the distribution of input of each layer the same during the learning process. LeCun et al. [9] and Wiesler et al. [10] showed that the learning of the neural network converges faster if its inputs are whitened. Whitening of the inputs of each layer make the changes of inputs of each layer uniform and can remove the bad effects of the internal covariate shift.
However to do the whitening of inputs of each layer is expensive because the whitening needs to calculate the covariance matrix of the inputs and solve the eigenvalue problem of the covariance matrix.
Instead of the whitening using the covariance matrix of the full training samples, Batch Normalization performs the dimension-wise normalization for the training samples in the mini-batch.
Let the inputs of a neuron in an layer for the mini-batch samples be \begin{equation*}
\mu_{x}=\frac{1}{m}\sum_{i=1}^{m}x_{i}. \tag{6}
\end{equation*}
Similarly the variance of the inputs of the neuron in the layer for the mini-batch samples is defined as
\begin{equation*}
\sigma_{x}^{2}=\frac{1}{m}\sum_{i=1}^{m} (x_{i}-\mu_{\beta})^{2}. \tag{7}
\end{equation*}
Then we can normalize the inputs by
\begin{equation*}
\hat{x}_{i}=\frac{x_{i}-\mu_{x}}{\sqrt{\sigma_{x}^{2}+\epsilon}}. \tag{8}
\end{equation*}
This process is known as the standarization in statistics and the mean and the variation of the normalized values
From these standarized inputs \begin{equation*}
y_{i}=\gamma\hat{x}_{i}+\beta. \tag{9}
\end{equation*}
In the Batch Normalized, the parameters
Then the mean of the output of the filter \begin{align*}
\mu_{h_{k}}&=\frac{1}{m}\sum_{i=1}^{m}\{\boldsymbol{w}_{k}^{T}\boldsymbol{y}+b_{k}\}=\boldsymbol{w}_{k}^{T}\left(\frac{1}{m}\sum_{i=1}^{m}\boldsymbol{y}_{i}\right)+b_{k}\\
&=\beta \boldsymbol{w}_{k}^{T}1+b_{k} \tag{10}
\end{align*}
C. Activation Functions
Historically, the sigmoid function have been often used as the activation function of the neurons in the artificial neural networks such as multi-layered Perceptron and Boltzmann Machine.
It is known that the standard activation functions such as the sigmoid function or the tangent hyperbolic function are contractive almost everywhere and the gradients at large values becomes almost zero. Thus the updates by the stochastic gradient decent becomes very small. This problem is known as the vanishing gradient problem. To improve the restricted Boltzmann Machine, Nair et al. [12] introduced the rectified linear units (ReLU) defined as
\begin{equation*}
f_{Relu}(h_{i, k})= max(0, h_{i, k}). \tag{11}
\end{equation*}
The graph of the rectified liner unit is shown in Fig. 2 in red.
The derivatives of the ReLU activation function is also shown in Fig. 3 in red.
Glorot et al. [13] showed that ReLU activation function in the hidden layers could improve the learning speed of the various deep neural networks. Now the rectified linear unit is used as the standard activation function for deep neural networks.
From this Fig. 3, it is noticed that the gradients at the positive values becomes constant and do not vanish anymore. This means that the vanishing gradient problem can be avoided by using ReLU activation function. This is the reason why ReLU activation function can improve the learning speed of the deep neural networks.
The graph of the activation functions ReLU and ELU. ReLU is shown by the red line and ELU is shown by the green line.
Further improvement of the activation function has been proposed in the literature.
As explained in the previous subsection, it is known that the learning of the neural networks can be improved when their input and hidden unit activities are centered about zero [9], [18]. Schraudolph et al. extended this to encompass the centering of error signals [19]. Raiko et al. [20] proposed a method of centering the activation of each neuron in order to keep the off-diagonal entries of the Fisher information matrix small. In the the Projected Natural Gradient Descent algorithm (PRONG) proposed by Desjardins et al. [21], the activation of each neuron is implicitly centered about zero by the whitening.
Recently Clevert et al. [14] proposed to use an activation function with negative outputs named Exponential Linear Units(ELUs). The activation function of ELU is defined as
\begin{equation*}
f_{ELU}(h_{k})=\begin{cases}
h_{k} & (h_{k} > 0)\\
\alpha(\exp(h_{k})-1) & (h_{k}\leq 0)
\end{cases} \tag{12}
\end{equation*}
In the learning algorithm based on the stochastic gradient, we have to calculate the partial derivatives of the output of the activation function with respect to the weights vectors
This can be derived as
\begin{equation*}
\frac{\partial f_{ELU}(h_{k})}{\partial \boldsymbol{w}_{k}}=\frac{\partial f_{ELU}(h_{k})}{\partial h_{k}}\frac{\partial h_{k}}{\partial \boldsymbol{w}_{k}}. \tag{13}
\end{equation*}
The derivative of the activation function ELU with respect to the input \begin{equation*}
\frac{\partial f_{ELU}(h_{k})}{\partial h_{k}}=\begin{cases}
1 & (h_{k} > 0)\\
f_{ELU}(h_{k})+\alpha & (h_{k}\leq 0)
\end{cases} \tag{14}
\end{equation*}
The graph of the derivative of the activation function of ELU is shown in Fig. 3 in green. The graph of the derivative of the activation function of ReLU is also shown in Fig. 3 in red. By comparing these two derivatives, it is noticed that the derivatives in the negative
The derivatives of activation functions ReLU and ELU with respect to the input of the activation function. ReLU is shown in red line and ELU is shown in green line. Also the derivative of the sparseness
Although ReLU activation function is effective to avoid the vanishing gradient problem, the outputs of the ReLU activation function at the negative values are always zero and the mean of the outputs of the ReLU activation function becomes positive. This means that the ReLU activation function introduce the effect of the internal covariate shift.
From Fig. 2, it is noticed that the ELU outputs the negative values for the negative inputs. Therefore the mean of the outputs of the ELU activation function becomes closer to zero than the case of the ReLU. This means that the ELU activation function can soften the effect of the internal covariate shift. We think that this is one of the main reason why the ELU activation function can improve the learning of the deep neural networks.
From Fig. 3, the derivative of the ELU activation function with respect to the input
D. Sparseness
It is well known that the neurons in the cat's visual cortex are sensitive to small regions of the visual field (receptive fields) [15]. The receptive fields are tiled to cover the entire visual field and each neuron acts as local filter over the input visual stimulus. The receptive fields of simple cells in mammalian primary visual cortex are characterized as being spatially localized, oriented. They are also selective to the structure at different spatial scales. This is comparable to the basis functions of Gabor wavelet.
To understand the response properties of visual neurons from the point view of the statistical structure of natural images, the researchers have attempted to train unsupervised learning algorithms on natural images. Olshausen and Field demonstrate that a learning algorithm that attempts to find sparse linear codes for natural scenes will develop a complete family of localized, oriented, bandpass receptive fields, similar to those found in the primary visual cortex [16], [17]. Also they show that the resulting sparse code possesses a higher degree of statistical independence and can provide a more efficient representation for the later stages of the visual processing.
Example of the architecture of CNN with two convolution layers and one fully-connected layer.
They assumed that a local region \begin{equation*}
I(x, y)=\sum_{i}a_{i}\phi_{i}(x, y). \tag{15}
\end{equation*}
This corresponds to the reconstruction mapping in the auto-encoder and the \begin{equation*}
Q=L+\lambda S, \tag{16}
\end{equation*}
Convolutional Neural Network with Sparse Regularization
From the discussion in the related works described in section II, the key to improve the learning of the deep neural networks is to avoid both the vanishing problem and the internal covariate shift.
In this paper we propose a method to introduce the sparseness regularization for the convolution neural network with the ReLU activation functions and show that the sparseness regularization can directly soften the effect of the internal covariate shift.
A. Proposed Architecture
Fig. 4 shows an example of the architecture of CNN with two convolution layers and one fully-connected layer.
As explained in the subsection II-A, each neuron of the convolution layer has a small receptive field in the input image and the outputs of the receptive field are calculated by a linear filter. This is expressed by the equation (1). In the pooling layer, the feature maps are down-sampled. Usually the soft-max function is used in the classification layer and the objective function of the learning is given by the negative log-likelihood of the outputs of the network as shown in the equation (3).
B. Sparse Regularization
Usually the sparseness is introduced to prevent the unnecessary increase of the parameters in the model. For example the sparseness of the weights is introduced in the weight decay. It is known that this can prevent overfitting to the training samples and can improve the generalization ability of the trained model.
Olshausen et al. [16] introduced the sparseness to the outputs of the hidden units. Similarly, in this paper we propose to introduce the sparseness to the inputs of the ReLU, namely the outputs of the linear filter. By introducing the sparseness to the inputs of the ReLU, both the unnecessary increase of the outputs of the ReLU can be prevented and the unnecessary negative inputs of the ReLU can be reduced. This means that the proposed method has the similar effect with Batch Normalization and also can improve the generalization ability.
There are many ways to evaluate the sparseness. In this paper, we use the sparseness of the input \begin{equation*}
S(h_{k})=\log(1+h_{k}^{2}). \tag{17}
\end{equation*}
This is one of the sparse terms introduced by Olshausen et al. [16].
Here we define the objective function of the optimization to determine the parameters of the network as
\begin{equation*}
E=L+ \lambda\sum_{k}S(h_{k}) \tag{18}
\end{equation*}
To understand the effect of the sparse term in the stochastic gradient method, it is necessary to consider the shape of the derivatives of the sparse term. The derivative of the sparse term \begin{equation*}
\frac{\partial S(h_{k})}{\partial h_{k}}=\frac{2h_{k}}{1+h_{k}^{2}}. \tag{19}
\end{equation*}
The graph of the derivative of the function
From this figure, it is noticed that the derivative at the positive input is positive but it is negative at the negative input. This means that there is an effect to push the inputs of the ReLU to zero in the learning process by introducing the sparseness. Thus it is expected that the unnecessary increase of the outputs of the ReLU can be prevented. This is the same effect with the Batch Normalization. Also the unnecessary negative values of the inputs of the ReLU can be reduced by introducing the sparseness regularization. This can improve the generalization ability of the trained network.
By comparing with the case of the ELU activation function, the derivative of this function has the opposite sign for the negative region. This means that the negative inputs of the ELU are pushed toward to large negative values. Since the ELU can output the negative values for the negative inputs, it is useful to reduce the mean of the outputs to zero but as the side effect the inputs of the ELU have unnecessary negative values. This is not good for the generalization.
On the other hand, the sparse regularization for the inputs of the ReLU can introduce both the effect to reduce the mean of the outputs of the ReLU and the effect to reduce the unnecessary negative values of the inputs of the ReLU.
We can also introduce sparse regularization for the outputs of the ReLU. This also has the same effect with the sparse regularization for the inputs of the ReLU for the positive values. However the effect to reduce the unnecessary negative values of the inputs of the ReLU is lost because the ReLU activation function makes all the negative input to zero.
Experiment
To confirm the effectiveness of the proposed method, we have performed experiments on the learning of the CNN using CIFAR-10 dataset.
A. Data Sets
In the experiments, we used the standard benchmark dataset, CIFAR-10. CIFAR-10 is the dataset of RGB color images for the 10 classes objects classification. Each image in the dataset is normalized to
B. CNN Architecture
To confirm the effectiveness of the sparse regularization for the CNN with ReLU activation function in the hidden layers, we have done the experiments using the CNN shown in the Figure 4.
This network consists of 4 layers, namely two convolution layers including max pooling, one fully-connected layer and one classifier layer. The size of convolution filter is set to
In the following experiments, the parameter
C. Effectiveness of the Sparse Term for the Inputs of ReLU
We have performed the experiments to evaluate the effectiveness of the sparse regularization for the inputs of ReLU. The Table II shows the recognition rates for the test samples. In this experiments, 50000 samples in the data set shown in Table I were used for training. In this experiments, the sparse regularization term is introduced to the input of the ReLU activation function. The recognition rates obtained by the proposed methods are shown as “with sparse term”. We have done the experiments by changing the layers to which the sparseness is introduced. In the Table II, “conv & fc”, “pool & fc”, and “conv & pool & fc” denotes the layers to which the sparse regularization is introduced. The label “conv” means that the sparse regularization are introduced to both of the convolution layers. The label “conv1 & pool1” means that the sparse regularization are introduced to the first convolution layer and the first pooling layer. For the comparison, the recognition rates obtained by the standard error back-propagation learning algorithm with and without Batch Normalization are shown as “Batch Normalization” and “original” in the Table II.
From this Table II, it is noticed that the proposed approach gives better recognition rates than the error back-propagation learning algorithm with and without Batch Normalization.
D. Detail Comparison of Sparse Regularization
To understand the effectiveness of the sparse regularization in detail, we have performed the experiments in which the layers introduced the sparse term are changed. In this experiments the sparseness was introduced to the outputs of the ReLU.
Figure 5 and Figure 6 show the learning curves of the recognition rates for the training samples and the test samples. In the Figure 5, the sparse regularization term is introduced to the input of the ReLU activation function. On the other hand, the sparse term is introduced to the output of the ReLU in Figure 6. For comparisons, the learning curves obtained by the original error back-propagation algorithm (shown as “origin”) and the error back-propagation algorithm with the Batch Normalization (shown as “Batch Normalization”) are also shown in these figures.
These graph are the learning curves of CNN with ReLU activation function. In this experiment, the full dataset is used as the training samples. The sparse regularization is introduced to the input of the ReLU activation functions.
From these figures, it is noticed that the learning speed of the Batch Normalization is faster than the other cases but the recognition rates for test samples become the best when the sparse regularization is introduced to the convolution, the pooling, and the fully-connected layers.
Table III shows the recognition rates for the test samples when the sparse term is introduced to the outputs of the ReLU activation function. Similarly the recognition rates obtained by the trained CNN are shown in the third column. Upper two rows are the results obtained by the original error backpropagation learning with and without Batch Normalization. Other rows include the results obtained by the error backpropagation with sparse regularization. The layers where the sparse regularization are introduced are denoted such as “conv” or “conv1 & pool1”.
From this table, it is noticed that the recognition rates for all cases using the sparse regularization are better than the results of the original error back-propagation with and without the Batch Normalization. Top recognition rate is achieved when the sparse regularization is introduced for all layers. It is also about 7% and 9% better than the original error pack-propagation learning with and without the Batch Normalization respectively.
Other higher recognition rates are also achieved when the sparse regularization is introduced to the all pooling layer and the fully-connected layer or to the first convolution layer, the second pooling layer and fully-connected layer.
These graph are the learning curves of CNN with ReLU activation function. In this experiment, the full dataset is used as the training samples. The sparse regularization is introduced to the output of the ReLU activation functions.
But when the sparse regularization is introduced to only fully-connected layer is not as good as other cases. This means that the sparse regularization should be introduced to both the feature extraction layers such as the convolution layers or the pooling layers and the classifier (the fully-connected layer).
By comparing the the table II and III, it is noticed that the recognition accuracies obtained by the sparse regularization to the input of the ReLU activation function are better than the cases in which the sparse term is introduced to the output of the ReLU activation function. We think that the reason is that the gradient of the sparse term for the negative value always becomes zero when the sparse term is introduced to the outputs of the ReLU activation functions. On the other hand the gradient of the sparse term for the negative values has some values when the sparse term is introduced to the inputs of the ReLU activation functions. This suggests that the sparse term should be introduced to the input of the ReLU.
Conclusion
In this paper we propose to introduce the sparse regularization to the inputs of the ReLU of the CNN. It is shown that the sparse regularization has the effect to push the inputs of the ReLU to zero in the learning process and the unnecessary increase of the outputs of the ReLU can be prevented. This is the similar effect with the Batch Normalization. Also the unnecessary growth of the negative values of the inputs of the ReLU can be reduced by introducing the sparse regularization to the inputs of the ReLU can be reduced. Thus it is expected that the generalization ability of the trained CNN can be improved. Through the detail experiments using CIFAR-10 dataset, the effectiveness of the proposed approach is confirmed.
We think that the key to improve the learning of the CNN is to solve the three problems; the vanishing gradient, the unnecessary growth of the inputs, and the bias shift. ReLU solved the vanishing gradient problem but introduced the problem of unnecessary growth of the inputs. The sparse regularization proposed in this paper solves the problem of unnecessary growth of the inputs. This is probably the reason why the combination of ReLU and the sparse regularization gives the better results. On the other hand, ELU can solve the problems of the vanishing gradient and the bias shift. The effectiveness of the combination of ELU and the sparse regularization should be investigated.
ACKNOWLEDGMENT
This work was partly supported by JSPS KAKENHI Grant Number 16K00239.








